Skip to content

Conversation

@gfx
Copy link

@gfx gfx commented Oct 1, 2025

In the langfuse web UI, output is encoded in unicode escape sequence (\uXXX). This is the default behavior of the python's json.dumps().

It's annoying for non-English poeple like me, so set ensure_ascii=False for json.dumps().


Important

Set ensure_ascii=False in JSON serialization to preserve Unicode characters and added tests to verify this behavior.

  • Behavior:
    • Set ensure_ascii=False in json.dumps() in _serialize() in attributes.py, _next() and _get_item_size() in score_ingestion_consumer.py, and post() in request.py to preserve Unicode characters.
  • Testing:
    • Added test_unicode_serialization.py to verify Unicode characters are preserved in serialized output.

This description was created by Ellipsis for 500db8d. You can customize this summary. It will automatically update as commits are pushed.

Disclaimer: Experimental PR review

Greptile Overview

Updated On: 2025-10-01 04:39:28 UTC

Summary

This pull request improves Unicode character handling in the Langfuse Python SDK by adding `ensure_ascii=False` to all `json.dumps()` calls throughout the codebase. The change affects three critical serialization points: the OpenTelemetry attributes serializer (`_serialize` function in `attributes.py`), the HTTP client for API requests (`request.py`), and the score ingestion consumer for batch processing (`score_ingestion_consumer.py`).

By default, Python's json.dumps() escapes non-ASCII characters as Unicode sequences (e.g., \u3053\u3093\u306b\u3061\u306f instead of こんにちは), making output unreadable for non-English users in the Langfuse web UI. This change preserves Unicode characters in their native form, significantly improving the user experience for international users working with Japanese, Chinese, Korean, Arabic, Russian, and other non-Latin scripts.

The PR includes a comprehensive test file (test_unicode_serialization.py) that validates Unicode preservation across multiple writing systems and emoji. The change is backward-compatible as the resulting JSON remains valid, and the modification is applied consistently across all serialization points to ensure uniform behavior throughout the SDK.

Important Files Changed

Changed Files
Filename Score Overview
langfuse/_client/attributes.py 5/5 Added ensure_ascii=False to json.dumps() in the _serialize function used for OpenTelemetry span attributes
langfuse/_utils/request.py 5/5 Added ensure_ascii=False to json.dumps() in the HTTP client used for all API requests to Langfuse
langfuse/_task_manager/score_ingestion_consumer.py 5/5 Added ensure_ascii=False to two json.dumps() calls in the score ingestion batch processing pipeline
tests/test_unicode_serialization.py 5/5 New comprehensive test file validating Unicode character preservation across multiple languages and emoji

Confidence score: 5/5

  • This PR is safe to merge with minimal risk as it only changes JSON serialization format without affecting data structures or API contracts
  • Score reflects the backward-compatible nature of the change and comprehensive test coverage for Unicode handling
  • No files require special attention as all changes are straightforward serialization parameter additions

Sequence Diagram

sequenceDiagram
    participant User
    participant LangfuseClient as "Langfuse Client"
    participant EventSerializer as "Event Serializer"
    participant JSONEncoder as "JSON Encoder"
    participant APIEndpoint as "API Endpoint"

    User->>LangfuseClient: "serialize data with unicode content"
    LangfuseClient->>EventSerializer: "_serialize(data)"
    EventSerializer->>JSONEncoder: "json.dumps(obj, cls=EventSerializer, ensure_ascii=False)"
    JSONEncoder-->>EventSerializer: "serialized json string with preserved unicode"
    EventSerializer-->>LangfuseClient: "unicode-preserved json string"
    
    User->>LangfuseClient: "batch_post(**kwargs)"
    LangfuseClient->>EventSerializer: "json.dumps(kwargs, cls=EventSerializer, ensure_ascii=False)"
    EventSerializer-->>LangfuseClient: "serialized data with unicode preserved"
    LangfuseClient->>APIEndpoint: "POST request with unicode content"
    APIEndpoint-->>LangfuseClient: "response"
    LangfuseClient-->>User: "response"

    User->>LangfuseClient: "upload score events"
    LangfuseClient->>EventSerializer: "serialize events with unicode"
    EventSerializer->>JSONEncoder: "json.dumps(event, cls=EventSerializer, ensure_ascii=False)"
    JSONEncoder-->>EventSerializer: "unicode-preserved serialization"
    EventSerializer-->>LangfuseClient: "serialized events"
    LangfuseClient->>APIEndpoint: "upload batch with unicode content"
    APIEndpoint-->>LangfuseClient: "upload response"
    LangfuseClient-->>User: "upload complete"
Loading

@CLAassistant
Copy link

CLAassistant commented Oct 1, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@gfx
Copy link
Author

gfx commented Oct 1, 2025

Ah, This is exactly the same as #1330. Closing.

@gfx gfx closed this Oct 1, 2025
@gfx gfx deleted the gfx/do_not_encode_unicode_in_outputs branch October 1, 2025 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants